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Abstract 

Background: Gene duplications play an important role in the evolution of functional protein diversity. Some 
models of duplicate gene evolution predict complex forms of paralog divergence; orthologous proteins may 
diverge as well, further complicating patterns of divergence among and within gene families. Consequently, 
studying the link between protein sequence evolution and duplication requires the use of flexible substitution 
models that can accommodate multiple shifts in selection across a phylogeny. Here, we employed a variety of 
codon substitution models, primarily Clade models, to explore how selective constraint evolved following the 
duplication of a green-sensitive (RH2a) visual pigment protein (opsin) in African cichlids. Past studies have linked 
opsin divergence to ecological and sexual divergence within the African cichlid adaptive radiation. Furthermore, 
biochemical and regulatory differences between the RH2aa and RH2a(3 paralogs have been documented. It thus 
seems likely that selection varies in complex ways throughout this gene family. 

Results: Clade model analysis of African cichlid RH2a opsins revealed a large increase in the 
nonsynonymous-to-synonymous substitution rate ratio (go) following the duplication, as well as an even larger 
increase, one consistent with positive selection, for Lake Tanganyikan cichlid RH2a(3 opsins. Analysis using the 
popular Branch-site models, by contrast, revealed no such alteration of constraint. Several amino acid sites known 
to influence spectral and non-spectral aspects of opsin biochemistry were found to be evolving divergently, 
suggesting that orthologous RH2a opsins may vary in terms of spectral sensitivity and response kinetics. Divergence 
appears to be occurring despite intronic gene conversion among the tandemly-arranged duplicates. 

Conclusions: Our findings indicate that variation in selective constraint is associated with both gene duplication 
and divergence among orthologs in African cichlid RH2a opsins. At least some of this variation may reflect an 
adaptive response to differences in light environment. Interestingly, these patterns only became apparent through 
the use of Clade models, not through the use of the more widely employed Branch-site models; we suggest that 
this difference stems from the increased flexibility associated with Clade models. Our results thus bear both on 
studies of cichlid visual system evolution and on studies of gene family evolution in general. 

Keywords: Codon substitution model, Visual pigment evolution, Nonsynonymous-to-synonymous substitution rate 
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Background 

Gene duplication is known to play major roles in gen- 
omic and phenotypic evolution, and often precipitates 
divergent evolution of protein structure and function 
[1,2]. A number of models have been proposed to ex- 
plain the retention and evolution of duplicated genes in 
the face of deleterious, pseudogenizing mutations [3-5]; 
these models differ in the predictions they make about 
post-duplication sequence evolution and, consequently, 
in how amenable they are to investigation. The classic 
neofunctionalization model [3], for example, predicts a 
fairly simple pattern of post-duplication protein se- 
quence evolution, with one paralog diverging while the 
other retains the ancestral function under a regime of 
purifying selection. However, other models, such as the 
duplication-degeneration-complementation model [6] 
and the escape from adaptive conflict model [7,8], pre- 
dict more complex forms of divergence. Furthermore, it 
is becoming increasingly recognized that divergent pro- 
tein evolution can occur without duplication— that is, 
among orthologs— and that this can contribute to adap- 
tive phenotypic evolution [9-11], contrary to classical 
assumptions [12]. Distinguishing among models of gene 
duplication, and determining the relative roles played by 
adaptive and non-adaptive processes in protein evolu- 
tion, thus requires approaches that can accommodate 
complex patterns of sequence evolution. 

Recent advances in codon substitution models that ac- 
count for variation in site-specific selective constraint 
among multiple clades or lineages [13,14] provide a 
promising approach for distinguishing among models of 
gene duplication and evolution. Branch-site models 
[15,16] are commonly used to detect the signature of 
strong site-specific positive selection along a pre- 
specified lineage after gene duplication (cf. 'conserved- 
but-different' or 'Type If divergence patterns) [17,18]. 
Clade models [19,20], meanwhile, can be used to detect 
more subtle differences in site-specific selective con- 
straint among entire clades or partitions of a phylogeny 
(cf. 'covarion-like' or 'Type I' divergence patterns) 
[17,18]. Clade models are not restricted to detecting 
strict cases of positive selection and can be used to con- 
sider variation among multiple clades simultaneously 
[21]. As such, Clade models may be better suited to 
detecting complex forms of divergence in selective con- 
straint across gene families than the Branch-site models 
[22]. However, compared to the popular Branch-site 
models, Clade models have been relatively under used. 

Species derived from recent adaptive radiations are in- 
triguing systems for studying patterns of evolution at the 
molecular level, as rapid phenotypic evolution implies a 
comparable degree of change in the underlying genome 
[23,24]. The endemic and diverse cichlid fishes of the 
Rift Valley of eastern Africa are thought to be the result 



of multiple, young, adaptive radiations [25,26]. The high 
numbers of species found within the Rift Valleys lakes 
and rivers, as well as the impressive degree of pheno- 
typic variation present among closely related species, 
make these fishes ideal for studies on functional diversi- 
fication and speciation. Recently, progress has been 
made associating adaptive phenotypic evolution in 
African cichlids with variation at the molecular level, for 
example with regard to jaw morphology or colour pat- 
terning, [27-30]. Perhaps most notably, a number of 
studies have linked ecological and sexual divergence 
among African cichlids to divergence in colour vision 
genes, the opsins [31,32]. Opsins form the protein com- 
ponent of visual pigments, the photosensitive com- 
pounds expressed in the rod and cone photoreceptor 
cells of the retina that absorb and transduce photons of 
light into the biochemical signals that ultimately underlie 
the visual sense [33-35]. Amino acid substitutions that 
affect the opsins retinal chromophore binding pocket 
can alter the pigment's absorbance spectrum, generating 
variation in spectral sensitivity [36,37]. Recent studies of 
African cichlid opsins have linked variation in opsin pro- 
tein sequences and expression patterns to ecologically- 
and sexually-selected divergence among closely related 
populations and species [38,39] and, as a result, African 
cichlids have emerged as model systems for study of the 
molecular biology, evolution, and ecology of opsins 
[31,32]. 

Compared to most vertebrates, African cichlids pos- 
sess a large number of opsin genes, with seven cone 
opsins and one rod opsin [40]. The resulting visual pig- 
ments vary in spectral sensitivity, with the wavelength of 
maximal absorbance (\ max ) ranging from the ultraviolet 
(UV) to the yellow. Most of these opsins are evolutionar- 
ily ancient, with orthologs present in most teleosts, if 
not most vertebrates. However, the green-sensitive 
RH2aa and RH2a|3 opsins are relatively young [41,42], 
and descend from a duplication event specific to the 
African cichlid clade [43]. Spectrophotometric study of 
RH2aa and RH2a(3 pigments from Oreochromis niloticus 
(Nile tilapia) and Metriaclima zebra (Lake Malawi zebra 
mbuna), expressed in vitro, revealed functional diver- 
gence between the paralogs, with the RH2aa pigments 
red-shifted (A max ~ 528 nm) compared to the RH2a(3 
pigments (A max ~ 518 nm) [41,42]. However, it has 
proven difficult to broadly survey for variation in 
RH2aa/(3 A max via microspectrophotometry (MSP) due 
to their fairly close A max values and the noise inherent in 
MSP [32], Regulatory differences are also apparent 
[42,44] but the high level of sequence similarity between 
these paralogs (-95% identical at the nucleotide level) 
makes quantitative PCR studies challenging. As a result 
of these methodological difficulties, it is sometimes sim- 
ply assumed that the RH2aa and RH2ap paralogs are 
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sufficiently similar to justify treating them equivalently 
[45,46]. However, there is reason to believe that compar- 
ably small differences in X max can be ecologically and 
evolutionarily important in African cichlids [38], and 
whether or not these opsins are functionally equivalent 
from the perspective of African cichlid visual biology 
and fitness is not clear. Comparative sequence analysis 
may be able to provide useful insights into this system. 

Given the important role vision plays in the African 
cichlid adaptive radiation [31,32] as well as the import- 
ant role gene duplication plays in functional diversifica- 
tion in general [3], we set out to explore patterns of 
sequence evolution associated with the RH2aa-RH2a|3 
gene duplication event using both Branch-site and Clade 
codon-substitution model approaches. We document 
complex patterns of divergence among duplicated 
African cichlid RH2aa and RH2a|3 opsins, reflecting both 
paralog-specific and species-specific processes. Import- 
antly, positive selection was documented not using 
Branch-site models, but the less widely employed Clade 
models. We discuss the implications of our findings in 
light of gene duplication theory, cichlid visual ecology, 
and opsin structure and function. 

Methods 

Phylogenetic analyses were carried out on a data set 
of 48 fish RH2 opsin sequences from 29 species; species 
names and accession numbers are provided in 
Additional file 1: Figure SI. Translated amino acid 
sequences were assembled in MEGA 4 [47] and aligned 
using ClustalW [48], after which the extreme N- and C- 
termini of the opsin sequences were trimmed, leaving an 
alignment 343 codons in length. Bayesian phylogenetic 
analysis was carried out using MrBayes 3.2 [49] using 
the GTR+I+r nucleotide substitution model, which was 
selected based on AIC rank, as calculated by MrModelt- 
est 2.2 [50]. Four runs, each consisting of four chains 
(three heated and one cold), were run for 5 x 10 6 gen- 
erations, sampling every 100 generations. The first 25% 
of the samples were considered 'burn-in' and discarded. 
Adequate sampling and convergence were assured by 
ensuring that (1) the standard-deviation of split frequen- 
cies was less than 0.01 by the end of the analysis, (2) 
post-scale reduction factors were approximately 1.000 
for all parameter estimates and topological partitions, (3) 
parameter estimate-by-generation plots were stationary, 
and (4) effective sample sizes were greater than 100. 
These checks were carried out through direct examin- 
ation of the MrBayes output file and using Tracer 1.5 
[51]. A codon-partitioned approach was employed as 
well, assuming separate GTR+I+r substitution models 
for each of the three codon positions. Maximum likeli- 
hood (ML) phylogenetic analyses were carried out using 
PhyML 3.0 [52] assuming the GTR+I+r nucleotide 



substitution model. Ten random trees plus a tree in- 
ferred using the BIONJ algorithm were used as starting 
trees, and both nearest neighbour interchange (NNI) 
and subtree-pruning and regrafting (SPR) tree-search 
approaches were used to explore tree space. Node sup- 
port for the ML tree was evaluated using the SH-like ap- 
proximate likelihood ratio test approach [52]. For both 
Bayesian and ML analyses, four rate categories were 
assumed for the +r distribution used to described 
among-site rate variation [53]. 

After estimating the fish RH2 phylogeny, we focused 
on the RH2a opsin clade for subsequent molecular evo- 
lutionary analyses. Specifically, we extracted and ana- 
lyzed the monophyletic sub-tree containing the 
duplicated African cichlid RH2aa and RH2a(3 opsins plus 
five outgroup RH2a sequences; species names and acces- 
sion numbers for these sequences can be found in 
Figure 1 and Additional file 1: Figure SI. Due to uneven 
taxonomic coverage in online genetic databases, the 
RH2ap clade possesses opsin sequences from three add- 
itional species compared to the RH2aa clade. The three 
extra sequences in the RH2ap clade are all derived from 
Lake Tanganyikan cichlids, which were surveyed prior to 
the discovery of the RH2a duplication event [54]; 
whether these species have retained or lost their RH2aa 
opsins is currently not known, though current phylogen- 
etic hypotheses for Lake Tanganyikan cichlids [55] sug- 
gest that multiple deletions would be required for all 
three species to truly lack RH2aa opsins in their respect- 
ive genomes (see Additional file 1: Figure S2). Note that 
the RH2a(3 sequences from Ophthalmotilapia ventralis 
and Neolamprologous brichardi were obtained from 
cDNA even though relative RH2a gene expression is 
skewed towards the RH2aa paralog in Oreochromis nilo- 
ticus and Lake Malawi haplochromine cichlids [42,44]. 
RH2a opsin sequences are available for many other Afri- 
can cichlids, but often differ from one another at just 
one or two positions; because including all such 
sequences would greatly increase computational time 
without adding much information, we chose to focus on 
key sequences representing well-studied species and key 
cichlid lineages. 

We explored patterns of selective constraint across 
this RH2a data set through the use of codon substitution 
models that include the nonsynonymous to synonymous 
substitution rate ratio (co or dN/dS) as a parameter 
[56,57]. The co ratio speaks to the form and strength of 
selective constraint operating on protein-coding DNA: 
0 < co < 1 is consistent with purifying selection (nonsy- 
nonymous substitutions are accumulating more slowly 
than synonymous substitutions), co = 1 suggests neutral- 
ity (nonsynonymous and synonymous substitutions are 
accumulating at equivalent rates), and co > 1 indicates 
positive selection (nonsynonymous substitutions are 
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Figure 1 RH2a opsin gene tree showing the African cichlid-specific duplication event that produced the RH2aa (thick, dotted, blue 
branches) and RH2aP (thick, solid/dashed, red branches) opsins. Dashed red branches indicate RH2a(3 lineages from Lake Tanganyikan 
cichlids. Numbers adjacent to nodes indicate clade posterior probability, and branch lengths indicate expected number of substitutions per 
nucleotide site. Branch model analyses (inset, upper left) revealed that co was significantly elevated along the Lake Tanganyikan branches 
compared to the rest of the RH2a(3 clade. Species code abbreviations indicate the first letter of the genus and first two letters of each species: Cfr, 
Crenicichlo frenata; Lgo, Luconio goodei; Mau, Melonochromis auratus; Mze, Metrioclima zebra; Nbr, Neolomprologus brichardi; Ola, Oryzios latipes; 
Oni, Oreochromis niloticus; Ove, Ophthalmotilapia ventralis; Ppu, Pundomilia pundamilia; Pre, Poecilia reticulata; Tdu, Tropheus duboisi; Tin, 
Tramitichromis intermedius. Genbank accession numbers are included alongside the species codes. 



accumulating faster than synonymous substitutions). 
Three different approaches were used, each of which 
makes different assumptions about how co varies across 
the alignment and/or across the phylogeny: (1) Branch 
models [58], (2) Branch-site models [16], and (3) Clade 
models [20]. Models were fit to the data using the 
codeml program of the PAML 4.2 software package 
[59]. Within each of these model classes, likelihood 
ratio tests (LRTs) were used to compare the fit of com- 
plex models against simpler, nested models [60,61]. 
LRTs were carried out by comparing twice the differ- 
ence in In likelihood scores of nested models against a 
X 2 distribution with the degrees of freedom equal to the 
number of extra parameters estimated by the more 
complex model. However, as LRTs can only be used to 
compare nested models, we also used AIC scores [62] 
to help convey the relative fit of the different Clade 
models applied to our fish RH2a data set. In addition to 
the selection pressure parameters (dN/dS, co; propor- 
tions, p), the transition-to-transversion substitution rate 
ratio (k) and branch lengths were optimized as well. 
Codon frequencies were approximated using the F3x4 



calculation. Each model was fit to the data multiple 
times from different starting parameter values to help 
ensure local optima were avoided, with either co or k 
perturbed, as needed, depending on the particular 
model. As these methods assume that the aligned 
sequences are related by a phylogenetic tree, not by a 
reticulating network, the signature of gene conversion 
within this data set was searched for using Phi [63], as 
implemented in PhiPack. Significance was assessed ei- 
ther assuming a normally approximated null distribu- 
tion or via a permutation approach (with 1000 
permutations), and analyses were run with the window 
size left at the recommended default value of w = 100 
and at w = 50. The results were qualitatively equivalent 
in spite of these changes; as such, only results derived 
using the normal approximation and w = 100 are 
shown. Dot plots were created using eBioX 1.5.1 [64]. 

Branch models [65] assume that the co ratio varies 
across branches of the phylogeny (specified a priori) but 
that it is invariant across sites of the alignment; compar- 
ing complex Branch models (i.e., ones with multiple co 
ratios) against simpler, nested models tests whether co 



Weadick and Chang BMC Evolutionary Biology 2012, 12:206 
http://www.biomedcentral.eom/l 47 1 -2 1 48/1 2/206 



Page 5 of 1 7 



varies significantly between sections of the phylogeny. 
Since Branch models make the unrealistic assumption of 
among-site homogeneity, they often lack power to detect 
subtle patterns of divergence across phylogenies, and we 
conducted post-hoc Branch model analysis simply to 
help demonstrate our Clade model partitioning schemes 
(described below). Branch-site models and Clade models 
similarly allow for variation in co among pre-specified 
branches of the phylogeny, but, unlike Branch models, 
also incorporate among-site variation in selective con- 
straint. The signature of positive selection (co > 1) along 
pre-specified lineages was tested for using the Branch- 
site approach of Zhang, Nielsen, and Yang [16]. This 
model assumes that there are four classes of sites and 
that the phylogeny can be divided into 'background' and 
'foreground' lineages based on an a priori hypothesis 
about when positive selection may have occurred. The 
first two classes of sites correspond to codons that ex- 
perience selection consistently across the entire phyl- 
ogeny, experiencing either purifying selection (0 < co 0 < 
1) or neutral pressure ((x>i = 1), respectively. The final 
two classes of sites correspond to codons that experience 
purifying or neutral selection on the background 
lineages, but positive selection (co 2 > 1) on the foreground 
lineage. These four site classes comprise proportions p 0 
(universally-purifying site class), pi (universally-neutral 
site class), poy 2 /(l - P2) (purifying- to-positive selection 
site class), and p\p 2 l{l - p 2 ) (neutral-to-positive selection 
site class) of the total data set (where p 2 = 1 - Po - Pi). 
The goodness-of-fit of this Branch-site model is estab- 
lished by comparing it via a LRT against a constrained null 
model where co 2 = 1; this LRT thus tests for the presence 
of positively selected sites. 

The signature of divergent selective constraint across 
the phylogeny was tested for using the Clade model C 
(CmC) approach of Bielawski and Yang [20] as modified 
by Yang, Wong, and Nielsen [66]. In its simplest form, 
CmC assumes that the branches of the phylogeny can be 
divided into two partitions, the 'background' branches 
and the 'foreground' branches. CmC accommodates 
among-site variation in the substitution process by as- 
suming three site classes. As with the Branch-site ap- 
proach described above, the first two classes of sites 
correspond to codons that experience selection consist- 
ently across the entire phylogeny, experiencing either 
purifying selection (0 < co 0 < 1) or neutral pressure (o^ = 
1). The third site class accounts for codons that experi- 
ence divergent selection pressures in different, pre- 
defined partitions (i.e., co 2 > 0 for the background 
branches and co 3 > 0 for the foreground branches). These 
site classes correspond to proportions p 0 , p lf and p 2 of 
the total data set (where p 2 = 1 - Po - P\). Models Mia 
and M2a_rel, neither of which incorporates among- 
lineage variation in co, were used as null models to test 



for the presence of divergently selected sites. Mia is the 
standard null model for CmC analyses [66]; Mia pos- 
sesses only two site classes: one for sites subject to puri- 
fying selection (0 < co 0 < 1), and one for neutral sites 
(to! = 1). However, our previous analyses of simulated 
data sets revealed that the CmC versus Mia LRT is 
prone to false positive test results when faced with 
moderate among-site variation in selective constraint. 
We therefore also employed our newly proposed 
M2a_rel null model for CmC analyses [67]; M2a_rel 
possesses purifying and neutral site classes, like the Mia 
model, but also possesses a third site class under which 
a single co ratio is estimated for all branches of the phyl- 
ogeny (co 2 > 0). We checked the robustness of our 
results to slight changes in model framework by reana- 
lyzing the data using the Clade model D (CmD) frame- 
work [20]; like CmC, CmD assumes three site classes 
with the final class modeling divergent selection among 
clades but, unlike CmC, no constraints are placed on the 
co estimates for any of the site classes. 

Yoshida et al. [21] recently extended CmC to allow for 
more than two tree partitions, each with a separately esti- 
mated co ratio. We refer to this as a 'multi-clade' ap- 
proach, and we used this approach to examine complex 
patterns of divergence in selection across the phylogeny 
by comparing such models against simpler, nested mod- 
els with fewer tree partitions. Assuming a phylogeny can 
be partitioned into three clades (X, Y, and Z), the multi- 
clade approach could be used to estimate three separate 
co ratios for the three tree partitions (co 2 > 0 for clade X, 
co 3 > 0 for clade Y, and co 4 > 0 for clade Z). Comparing 
this model against a simpler, null model with only two 
tree partitions (say, co 2 > 0 for clade X, and co 3 > 0 for 
both clade Y and clade Z) would constitute a test of 
whether selective constraint is equivalent in clades Y and 
Z (i.e., whether or not co 3 * co 4 ). This LRT's null model is 
formed by imposing a single, non-boundary constraint 
on the alternative model (i.e., the constraint that co 3 = 
co 4 ), reducing model size by one estimated parameter. As 
a result, the null distribution for this LRT should follow a 
X 2 distribution with one degree of freedom. We therefore 
generated simulated data sets assuming a CmC frame- 
work (with two tree partitions), and used these data sets 
to evaluate this multi-clade LRT's false -positive rate. 

Simulated data sets were generated using the evolver 
program of the PAML 4.2 software package [68]. Follow- 
ing our earlier simulation study of CmC LRTs [67], 100 
data sets of 10 taxa and 500 codons were simulated 
under CmC assuming the topology, branch lengths, and 
partitioning shown in Figure 2a. Half of the codons of 
each simulated data set (p 0 = 0.5) were generated assum- 
ing strong purifying selection (co = 0.0), p x = 0.20 were 
generated assuming neutrality ((x>i = 1.0), and the 
remaining codons (p 2 = 0.3) were simulated assuming 
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Figure 2 Simulation-based analyses were used to establish the false-positive rate of the LRT comparing CmC assuming three tree 
partitions against a null version of CmC assuming only two tree partitions, (a) The two-partition tree used for simulating null data sets, (b) 
The three-partition tree used as the LRT's 'alternative' model, (c) Histogram showing the distribution of LRT test-statistics from analysis of 100 data 
sets simulated under the null model, (d) The same data as in (c), but here plotted as an empirical cumulative density function. For both (c) and 
(d), the solid, curved line shows the expected x? distribution. 



divergent selection pressure between the solid (co 2 = 
0.15) and dashed (co 3 = 0.65) tree partitions. Additional 
parameters for the simulations were set as follows: the 
transition to transversion substitution rate ratio (k) was 
set to k = 2.0; the total tree length (TL) was set to TL = 
3.0 substitutions per codon; and the equilibrium fre- 
quency for each sense codon was set to 1/61. For null 
model analyses, we ran CmC assuming the two parti- 
tions shown in Figure 2a (i.e., correct partitioning), while 
for alternative model analyses, we ran CmC assuming 
the three partitions shown in Figure 2b (i.e., overly com- 
plex partitioning). Branch lengths were freely estimated, 
but k and the equilibrium codon frequencies were fixed 
at their simulated values. Analyses were run multiple 
times from different starting co values to help detect and 
avoid local optima in the likelihood surface, as described 
in Weadick and Chang [67]. The LRT test statistics for 
each of the 100 simulated data sets— twice the difference 
in In likelihood scores from the alternative and null 



model analyses— were calculated, compiled, and com- 
pared against the expected Xi null distribution. Observed 
and expected cumulative density functions were com- 
pared via a one-sided Kolmogorov-Smirnov test and a 
one-sided binomial test. We carried out additional ana- 
lyses on the 100 simulated data sets where the phylogen- 
etic partitioning of the alternative and null models was 
misspecified, with branch 10 of the tree shown in 
Figure 2a,b included in the 'dashed' partition, rather 
than the correct solid' partition. 

Following Clade model analysis, a Bayes empirical 
Bayes (BEB) approach was used to identify specific 
codons with high posterior probability (PP) of being in 
the 'divergent selection' site class [66]. BEB-identified 
sites were then mapped on to the three-dimensional 
crystal structure of bovine rhodopsin (PDB accession 
lul9) [69] using MacPyMol (Delano Scientific). The 
phylogenetic location of specific amino acid substitu- 
tions was inferred by using ancestral reconstruction 
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methods [70] to estimate the most probable residue at 
each node under the WAG+F+r amino acid substitution 
model [53,71]. Site numbering is based on alignment 
against bovine RH1 opsin (rhodopsin). 

Results 

Phylogenetic analyses recovered reciprocally monophy- 
letic clades of African cichlid RH2aa and RH2ap opsins 
and a sister relationship between the African cichlid 
RH2a opsin clade and the RH2a opsin of the Neotropical 
cichlid Crenicichla frenata (Figure 1). Trees estimated 
via the codon-partitioned Bayesian approach and the 
ML approach were highly similar, and only differed with 
respect to the arrangement of a few of the highly similar 
haplochromine cichlid RH2a opsin sequences; the full 
Bayesian (codon-partitioned) and ML trees are provided 
in Additional file 1: Figure SI. The non-partitioned and 
codon-partitioned Bayesian methods yielded trees with 
equivalent branching patterns (result not shown). For 
molecular evolutionary analyses, we focused on the sub- 
tree corresponding to the RH2a opsins of cichlids and 
those of closely related atherinomorph fishes (guppy, 
Poecilia reticulata; bluefin killifish, Lucania goodei; and 
medaka, Oryzias latipes) from the codon-partitioned 
Bayesian phylogeny (Figure 1). Fitting the simple M0 
codon substitution model to this data set provided an 
overall co estimate of 0.1476, indicating that purifying se- 
lection is the predominant force shaping the evolution 
of these RH2a opsin sequences. Estimates of dS, calcu- 
lated for each branch given the branch length, the num- 
ber of nonsynonymous and synonymous sites in the 
sequence, and the overall co estimate calculated under 
the M0 model, were always well below one, indicating 
that saturation of synonymous substitutions is unlikely 
to adversely affect our analyses. 

Gene conversion between paralogs violates the 
assumptions of ML methods for estimating co, and can 
even cause spurious signatures of positive selection 
under some conditions [72]. It has previously been sug- 
gested that gene conversion is unlikely to affect the 
African cichlid RH2a opsins because the paralogs are 
arranged in a head-to-head manner [42]. However, re- 
cent work on rodent genomes has shown that duplicates 
oriented in such a fashion are as prone to gene conver- 
sion as those oriented in the typical head-to-tail manner 
[73]. We therefore used Phi to test for local correlations 
in phylogenetic incompatibility across the data set (i.e., 
across the alignment). While we did detect a significant 
signature of recombination (P = 0.028), this reflected 
gene conversion between the distantly related RH2B and 
RH2C paralogs of the medaka (Oryzies latipes), not gene 
conversion between the African cichlid RH2a opsins. 
Unlike the African cichlid RH2a paralogs, these dupli- 
cated medaka opsins are oriented in a head-to-tail 



manner, indicating an independent duplication event 
[40]. Visual inspection revealed that the third and fourth 
introns of the medakas RH2B and RH2C paralogs are 
highly similar, while the first and second introns are di- 
vergent (Additional file 1: Figure S3), and removing ei- 
ther (or both) of the medaka RH2B/C paralogs from the 
data set eliminated the signature of conversion (P > 0.50 
in all cases). Furthermore, strong gene conversion 
should result in sequences clustering by species, not by 
paralog, and this pattern is not observed within the Afri- 
can cichlid RH2a opsin portion of the estimated phyl- 
ogeny (Figure 1). These results suggest that gene 
conversion has not had a notable impact on the evolu- 
tionary trajectories of the coding sequences of the dupli- 
cated African cichlid RH2a opsins and thus should not 
have adverse effects on our analyses. Visual inspection 
of the RH2aa and RH2a(3 opsins of Oreochromis niloti- 
cus [74], however, revealed that introns one and four are 
highly similar, while introns two and three are relatively 
divergent (Additional file 1: Figure S3). This may indi- 
cate that natural selection is maintaining distinct opsin 
coding sequences in spite of at least some intronic gene 
conversion, as occurs in human red and green opsins 
[75]. Alternatively, the fact that these introns are highly 
similar may reflect strong purifying selection on non- 
coding motifs with roles in gene regulation or splicing 
control; more intronic RH2a sequence data, obtained 
from numerous species, will be needed to address these 
possibilities. 

Several amino acid substitutions occurred along 
the RH2aa and RH2ap post-duplication branches 
(Additional file 1: Table SI), but in neither case did we 
obtain evidence indicative of adaptive divergence follow- 
ing gene duplication using Branch-sites methods 
(Table 1). The Branch-site LRT for positive selection [16] 
was applied four times, with different branches set as the 
foreground partition: (1) the RH2aa post-duplication 
branch; (2) the RH2ap post-duplication branch; (3) both 
post-duplication branches, combined; and (4) the branch 
joining the paralogous clades in a reduced data set for 
which outgroup sequences were excluded. Our Branch- 
site tests remained non-significant even when a more lib- 
eral null distribution (a 50:50 mixture of 0 and Xi) was 
employed instead. We do note, however, that the substi- 
tutions inferred along the RH2ap post-duplication 
branch were densely clustered (substitutions occurred at 
six sites in a 13 site stretch: sites 27-39; Additional file 1: 
Table SI). This region of the protein has recently been 
proposed to serve as the entry channel for the retinal 
chromophore into the proteins binding pocket [76]. 
One of these substitutions (I36F) involved the replace- 
ment of a non-aromatic isoleucine residue with an aro- 
matic phenylalanine residue at a site located at the base 
of the proposed entry channel; aromatic residues are 
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Table 1 Parameter estimates, log-likelihood scores, and likelihood ratio test (LRT) P values obtained from Branch-site 
analyses of the RH2a data set 

Model (n.p.) 



Site class 0 
w 0 Po 



Site class 1 



Site class 2 

w 2 p 2 



Inl 



LRT P 
value 



BrS-A a (37) 


0.0235 


0.8241 


I 0.1759 


1 .0000 


0.0000 


1.7670 


-3789.9781 


1 .0000 


BrS-N a (36) 


0.0235 


0.8241 


I 0.1759 


1 


0.0000 


1.7670 


-3789.9781 




BrS-A (3 (37) 


0.0230 


0.6151 


I 0.1296 


1 .0000 


0.2553 


1 .7652 


-3789.1087 


1 .0000 


BrS-N (3 (36) 


0.0230 


0.6151 


I 0.1296 


1 


0.2553 


1 .7652 


-3789.1087 




BrS-A aP (37) 


0.0234 


0.8205 


I 0.1746 


2.8146 


0.0049 


1 .7670 


-3789.9535 


0.8643 


BrS-N aP (36) 


0.0235 


0.8182 


I 0.1743 


1 


0.0076 


1 .7669 


-3789.9681 




BrS-A aP reduced (27) 


0.0000 


0.7487 


I 0.2514 


1 .0000 


0.0000 


2.2643 


-2200.6106 


1 .0000 


BrS-N aP reduced (26) 


0.0000 


0.7487 


I 0.2514 


1 


0.0000 


2.2643 


-2200.6106 





NOTE — n.p number of parameters. 

thought to assist in retinal uptake [76], suggesting that 
this change may enhance visual pigment regeneration 
rate. 

Given our initial findings using Branch-site methods, 
we explored alternative signatures of divergence in se- 
lective constraint using the Clade model C (CmC) ap- 
proach [20,21,66]. First, we built on our earlier 
simulation-based study of CmC LRTs [67] in order to 
evaluate the appropriateness of the Xi null distribution 
for the multi-clade LRT comparing CmC with three tree 
partitions against a simpler, nested version of CmC with 
only two tree partitions [21]. The empirical and expected 
distributions of LRT test statistics from our simulation 
analyses are plotted in Figures 2c and 2d, and it can be 
seen that the two distributions follow one another quite 
closely. Parameter estimates under the alternative and 
null model are summarized in Additional file 1: Figure 
S4. Although eight of the 100 tests indicated positive test 
results, this value is not significantly different from the 
expected 5% (one-sided binomial test: P = 0.1280). Fur- 
thermore, a Kolmogorov-Smirnov test comparing the 
observed and expected cumulative density functions was 
non-significant, indicating a good fit to the expected null 
distribution (one-sided test for an empirical distribution 
falling below the x? distribution: D = 0.0609; P = 
0.4759). Our simulation results therefore suggest that 
the LRT comparing CmC with three tree partitions 
against a simpler, nested version of CmC with two tree 
partitions can be evaluated using a Xi null distribution, 
though we note that further analyses are needed to fully 
evaluate the reliability and power of this approach. Re- 
cently, Gossmann and Schmid [77] carried out similar 
simulation-based analyses of this LRT, concluding that it 
has fair-to-good power and a relatively low false positive 
rate; however, these analyses were carried out on smaller 
data sets than we employed here. 

Additional analyses of the simulated data sets were 
carried out using an incorrect phylogenetic partitioning 
strategy. Given the tree shown in Figure 2b, we treated 



the branch leading to tip #10 as part of the 'dashed' par- 
tition, rather than the correct solid' partition, and we 
then tested for divergence (i.e., co 3 * co 4 ) by comparing 
the 'dotted' partition against the expanded, heteroge- 
neous 'dashed' partition. Based on the simulated branch 
lengths and co parameter values, misspecifying the phylo- 
genetic partitioning in this way should reduce the co 3 es- 
timate (to co 3 ~ 0.52) compared to the co 4 estimate (co 4 ~ 
0.65), generating a signature of divergence. We found 
that 23 of the 100 LRTs were significant, suggesting 
weak power for this test under the given conditions. 
Maximum likelihood estimates of the parameter values 
appeared to be accurate for most of the parameters (co 0 , 
co 2 , co 3 , p 0 , p 2 ), though slightly upwardly biased for the 
co 4 estimate (Additional file 1: Figure S4). Given these 
results, we conclude that the properly specified multi- 
clade LRT is statistically sound, but caution that care 
must be taken when designating partitions for Clade 
model analyses. Specifically, we recommend that parti- 
tioning choices be carefully based on external considera- 
tions, such as gene duplication theory, taxonomic 
sampling, or phylogenetic patterns of niche variation. Of 
course, additional simulation-based studies of this LRT's 
properties will be beneficial, and future work should ad- 
dress the performance of this test using larger data sets 
and assuming more complex evolutionary scenarios 
designed to challenge the assumptions of the alternative 
and null models. 

We first applied CmC with either the entire RH2aa 
clade ('CmC a') or the entire RH2a|3 clade ('CmC |3') set 
as the foreground partition (Figure 1); all other branches 
comprised the background partition. In both cases, the 
(0 ratio for the divergent selection site class changed 
from less than one on the background lineages (co 2 ~ 
0.55-0.65) to greater than one on the foreground 
lineages (co 3 ~ 1.15-1.53), with approximately 21% of the 
data set assigned to the divergent selection site class 
(Table 2). These estimates suggest that selective con- 
straint was relaxed following duplication, and may 
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Table 2 Parameter estimates, log-likelihood scores, and AIC weights obtained from Clade model analyses of the RH2a 
data set 



Model (n.p.) 


SCO 




SC 1 


SC 2 




K 


Inl 


A AIC 


AIC weight 






Po 




U) 2 / w 3/ U) 4 


Pi 










CmC qPmvr & P T (39) 


0.0142 


0.7890 


1 0.0000 


oo 2 : 0.4988 
oo 3 : 1.0919 
oo 4 : 2.6122 


0.2110 


1.7071 


-3773.9921 




0.6394 


CmC aP (38) 


0.0151 


0.7925 


1 0.0000 


oo 2 : 0.5062 
oo 3 : 1 .3889 


0.2075 


1 .7089 


-3776.0852 


2.19 


0.2143 


CmC a & P (39) 


0.0147 


0.791 1 


1 0.0000 


oo 2 : 0.5037 
oo 3 : 1.1308 
oo 4 : 1 .5420 


0.2089 


1 .7085 


-3775.7671 


3.55 


0.1084 


CmC P (38) 


0.0127 


0.7838 


1 0.0000 


oo 2 : 0.5471 
oo 3 : 1.5315 


0.2163 


1.7033 


-3778.3671 


6.75 


0.0219 


CmC p T (38) 


0.0115 


0.7789 


1 0.0000 


oo 2 : 0.5882 
oo 3 : 2.5296 


0.2211 


1.6971 


-3778.6874 


7.39 


0.0159 


CmC a (38) 


0.0138 


0.7882 


1 0.0000 


oo 2 : 0.6458 
oo 3 : 1.1498 


0.2119 


1 .6993 


-3784.5357 


19.09 


0.0000 


M2a_rel (37) 


0.0089 


0.7639 


1 0.0512 


oo 2 : 0.5401 


0.1849 


1 .6956 


-3785.9512 


19.92 


0.0000 


Mia (35) 


0.0235 


0.8241 


1 0.1759 






1 .7669 


-3789.9781 


23.97 


0.0000 



NOTE— n.p. = number of parameters; SC = site class. 
Models are ranked according to AIC score. 



indicate the action of weak positive selection as well 
Both models fit the data significantly better than the 
Mia null model, but 'CmC a did not fit the data signifi- 
cantly better than the more reliable M2a_rel null model 
(Table 3). However, this 'CmC a' model includes the 
RH2ap clade as part of the 'background' partition along- 
side the outgroup orthologs of non- African cichlids. This 
may be inappropriate, as the 'CmC p' vs. M2a_rel LRT 
suggested a large increase in co for the RH2ap clade. To 
evaluate this possibility, we employed a multi-clade 
CmC approach assuming three partitions: the RH2aa 
branches, the RH2ap branches, and the outgroup ortho- 
logous branches. Using this model, which we call the 
'CmC a & p' model, it was estimated that approximately 
21% of the data set evolved under divergent selective 
constraint across the three partitions, with a co ratio less 
than one along the outgroup branches (co 2 = 0.50), 
slightly above one along the RH2aa branches (co 3 = 
1.13), and somewhat higher still along the RH2ap 
branches (co 4 = 1.54) (Table 2). Comparison against the 
'CmC p' model yielded a significant LRT result (Table 3), 
indicating that selective constraint did indeed change 
after duplication in the RH2aa clade; this difference only 
became statistically significant once the elevated co ratio 
for the RH2ap clade was accounted for in the model. 
Comparison against a different null model, one where all 
African cichlid RH2a branches were considered as a sin- 
gle foreground partition (termed the 'CmC ap' model), 



revealed that the co ratios estimated for the RH2aa and 
RH2ap partitions under the 'CmC a & p' model were 
not significantly different from one another (Table 3), 
with a common co ratio estimate of co 3 = 1.39 (Table 2). 

Our RH2a data set was taxonomically unbalanced, 
possessing three Lake Tanganyikan cichlid RH2ap 
sequences without corresponding RH2aa paralogs, and 
our Branch model analyses revealed that selective con- 
straint was significantly different between the branches 
corresponding to these Lake Tanganyikan cichlid opsin 
RH2ap lineages and the remaining RH2ap branches 
(Figure 1, inset). It is therefore possible that the increased 
co ratio observed for the RH2ap clade in our CmC analyses 
was driven by these Lake Tanganyikan RH2ap sequences. 
In lieu of removing these taxonomically-unbalancing 
sequences from the alignment, which would represent an 
unfortunate loss of data, we carried out further analyses 
using the recently developed multi-Clade approach [21]. 
Specifically, we added a third partition to the 'CmC ap' 
model that separated the Lake Tanganyikan RH2ap 
branches (p T ) from the Lake Malawi, Lake Victoria, and 
riverine RH2aa and RH2ap branches (ap MV R)- Using this 
model, which we call the 'CmC ap MV R & Pt model, it was 
estimated that approximately 21% of the data set evolved 
under divergent selective constraint, with a co ratio less 
than one along the outgroup branches (co 2 = 0.50), slightly 
above one along the ap MV R branches (co 3 = 1.09), and sub- 
stantially higher along the p T branches (co 4 = 2.61) 
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Table 3 Likelihood ratio test (LRT) P values for nested Clade model C (CmC) comparisons 


Null model 


Alternative model 










Mia 


M2a_rel 


CmC a CmC (3 


CmC a(3 


CmC p T 


CmC qPmvr & P T 


< 0.0001 (4) 


< 0.0001 (2) 




0.0408 (1) 


0.0022 (1) 


CmC a & P 


< 0.0001 (4) 


< 0.0001 (2) 


< 0.0001 (1) 0.0226 (1) 


0.4251 (1) 




CmC aP 


< 0.0001 (3) 


< 0.0001 (1) 








CmC a 


0.0124 (3) 


0.0925 (1) 








CmC P 


< 0.0001 (3) 


0.0001 (1) 








CmC p T 


< 0.0001 (3) 


0.0001 (1) 









Degrees of freedom for each LRT are indicated in parentheses. 



(Table 2). Out of the Clade models considered, the 'CmC 
a pMVR & Pt model was the best fitting according to AIC 
scores, and comparison against the 'CmC a|3' model 
yielded a significant LRT result (Table 3), indicating that 
the (o ratio differs significantly between the Lake Tanganyi- 
kan cichlid RH2ap opsins (the p T partition) and the 
RH2aa and RH2ap opsins of other African cichlids (the 
ap M vR partition). Importantly, partitioning the tree in this 
manner did not eliminate the co ratio increase observed for 
African cichlid RH2a opsin clade compared to the outgroup 
orthologs (LRT against 'CmC p T '; Tables 2, 3). Finally, the 
divergent co ratio estimate for the Lake Tanganyikan RH2ap 
branches (the p T partition) was significantly greater than 
to = l (Table 4), indicating that the increase in co seen for 
the p T partition is due to site-specific positive selection and 
not simply relaxed functional constraint. Conversely, the di- 
vergent co ratio estimated for the ap MV R partition, though 
elevated, was not found to significantly exceed co = 1 
(Table 4). 

Reanalyzing the data under the CmD framework gave 
broadly similar results (compare Tables 2 and 3 with 
Additional file 1: Tables S2 and S3). First, parameter esti- 
mates were qualitatively similar between CmC and CmD 
analogues. Second, the rank order of CmC and CmD analo- 
gues was almost completely equivalent, with the only 
difference being the relative position of the two and 
three site-class null models (Mia and M2a_rel for CmC; 
M3 (K = 2) and M3 (K = 3) for CmD), which in both cases 
were very poor fits to the data. And, finally, P values for the 
various LRTs were generally similar, with only three tests 
providing qualitatively different results under the two 
model frameworks (one test achieved significance under 
CmD but not under CmC, and two others achieved signifi- 
cance under CmC but not CmD); notably, non-significant 



P values were still less than P = 0.10 for these each of these 
three cases. Notwithstanding these few differences, the 
broad similarity of AIC ranks, the majority of the LRTs, 
and the qualitative agreement in parameter estimates sug- 
gest that our CmC analyses have provided reliable infer- 
ences on cichlid RH2a opsin evolution. 

Eighty-one of the 343 total codons in the alignment 
were variable at the amino acid level and, of these, 27 
were identified by BEB analysis as members of the diver- 
gently evolving site class under the best fitting 'CmC 
apMVR & Pt model (PP > 0.75 for each site) (Table 5). 
Included among these identified sites are seven of the 10 
sites that substituted along the RH2aa and RH2ap post- 
duplication branches (Additional file 1: Table SI). The 
position of these sites within the opsin protein's second- 
ary and tertiary structure are shown in Figure 3. Most of 
these sites are situated within the opsin proteins extra- 
cellular half (Figure 3) and several are, or are adjacent 
to, known RH2 pigment spectral tuning sites (Table 5). 
Most prominent among these is site 122; all African 
cichlid species in our data set possess a glutamate resi- 
due (E122) at this site except for the Lake Tanganyikan 
species Ophthalmotilapia ventralis, which instead pos- 
sesses a glutamine residue (Q122). The E122Q substitu- 
tion is known to have a large effect on RH2 pigment 
spectral sensitivity, shifting X max to shorter wavelengths 
by -12-16 nm [78-80]. Moreover, this substitution is 
known to dramatically affect non-spectral properties of 
visual pigments (discussed below). Two other notable 
substitutions occurred along this same branch: F213V 
and G273V. Mutating site 273 has been shown to affect 
retinal uptake [76,81], while substitutions at site 213 are 
known to have slight effects on X max in zebrafish (Danio 
rerio) RH2 opsins [80]. Furthermore, substitutions at site 



Table 4 Likelihood ratio tests (LRTs) to determine whether foreground co estimates from the 'CmC ap MV R & Pt' model 
significantly differ from to = 1 

Foreground partition Unconstrained w estimate InL with constraint w = 1 P-value 

ap MVR (RH2aa and non-Tanganyikan RH2a(3 branches) 1.0919 -3774.0580 0.7166 

Pt tanganyikan RH2aP branches) 2.6122 -3777.2771 0.0104 



Both LRTs have 1 degree of freedom. 
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Table 5 Sites identified as divergently evolving by Bayes 
empirical Bayes inference under the 'CmC ap MV R & P/ 
Clade model 



Site 


PP 


Location 


Notes on Opsin Structure-Function 


22 


0.90 


I\|-t6rm 




24 


0.81 


N-term 




27 


0.91 


N-term 




31 


0.75 


N-term 




DO 


U.oo 


M farm / ThA 1 

i\i-ter m / i ivu 


RH2 spectral tuning site [80]. Situated on 
edge of retinal uptake/release channel [76]. 


56 


0.90 


TM1 




99 


0.91 


TM2 




107 


0.96 


E1 


Adjacent to RH2 spectral tuning site 
(site 108) [80]. 


109 


0.86 


E1 / TM3 


Adjacent to RH2 spectral tuning site 
(site 108) [80]. Adjacent to cysteine bond 
site (site 110) [82]. 


112 


0.80 


TM3 


RH2 spectral tuning site [80]. Adjacent to 
opsin counterion (site 113) [83]. 


122 


0.85 


TM3 


Major RH2 spectral tuning site [78]. Also 
influences G protein activation efficiency, 

ar-tix/p-ctatp rlpfsv r^fp ?inH i^l ninmpnt 
a^uvc jLaic ucL.ay icilc, ai in viouai piyiiiciii 

regeneration rate [84]. 


149 


0.77 


C2 / TM4 


Possible phosphorylation site [85,86]. 


158 


0.86 


TM4 




162 


0.94 


TM4 




165 


0.79 


TM4 


Possible RH2 spectral tuning site [80]. 


179 


0.88 


E2 




213 


0.75 


TM5 


RH2 spectral tuning site [80]. 


214 


0.89 


TM5 


Adjacent to RH2 spectral tuning site 
(site 213) [80]. 


218 


0.86 


TM5 


RH2 spectral tuning site [80]. 


263 


0.90 


TM6 




273 


0.85 


TM6 


Role in retinal uptake [76,81]. 


277 


0.90 


TM6 / E3 




282 


0.80 


E3 




284 


0.75 


E3 




290 


0.93 


TM7 




304 


0.97 


TM7 




335 


0.79 


C-term 


Possible phosphorylation site [85,86]. 



NOTE — Site numbering follows bovine RH1 opsin. Approximate location 
follows Sakmar et al. [34]. Abbreviations: N-term N-terminal tail, TM 
Transmembrane helix, E, Extracellular loop, C Cytoplasmic loop, C-term 
C-terminal tail. 

Only sites with posterior probability (PP) > 0.75 are shown. Underlined sites 
are those which substituted along branches within the Lake Tanganyikan 
cichlid RH2ap partition of the phylogeny. 

213 and 273 have been experimentally shown to influ- 
ence the decay rate of the activated visual pigment in 
zebrafish RH1 opsins (J.M. Morrow and B.S.W. Chang, 
unpublished data). Interestingly, these three sites (122, 
213, and 273) all substituted independently along the 
branch terminating with the Oryzies latipes (medaka) 



RH2C sequence, which suggests that some non-spectral 
opsin properties may be evolving convergently. 

A total of 178.5 amino acid changes were estimated to 
have occurred across the entire tree (given an alignment 
of 343 codons and a total tree length of 0.52032 amino 
acid substitutions per site, as estimated under the WAG 
+F+r amino acid substitution model). By reconstructing 
the ancestral amino acid residues at each node, we infer 
that 86 nonsynonymous changes, at minimum, occurred 
at the 27 BEB identified sites (mean ± SD = 3.19 ± 1.11 
amino acid changes per BEB site; range = 2-6). For 18 
of these 27 BEB sites, substitutions occurred along 
branches within the Lake Tanganyikan cichlid RH2a|3 
partition. Examining the changes at these sites in the 
context of the phylogeny reveals a few interesting pat- 
terns (Figure 4, Additional file 1: Table S4). First, BEB 
identified sites 273 and 277 both substituted along the 
branch terminating in the Ophthalmotilapia ventralis 
RH2a|3 opsin sequence; these two sites are situated ap- 
proximately one a helical turn apart in the opsins sixth 
transmembrane a helix, suggesting revolutionary 
change. Possibly, this pattern reflects compensatory evo- 
lution, as one of the substitutions (G273V) introduced a 
larger amino acid while the other (M277L) introduced a 
smaller one. Similarly, BEB identified sites 158 and 162 
are both situated one a helical turn apart in the fourth 
transmembrane a helix, and both substituted along the 
branch terminating in the Tropheus duboisi RH2a(3 opsin 
sequence. Again, we see one substitution introduce a lar- 
ger reside (L158F) and the other introduce a smaller 
residue (1 162V). These two sites both project outwards 
into a proposed opsin dimerization interface [87], and 
position 162 has been experimentally shown to contrib- 
ute to dimerization in the homologous dopamine D2 re- 
ceptor proteins [88]. Changes at these sites may thus 
influence dimerization strength, which, in turn, can in- 
fluence G protein binding [87]. Interestingly, these 
changes also occurred along the RH2aa post-duplication 
branch, suggesting not simply coevolution, but conver- 
gence as well. Finally, a cluster of BEB identified sites 
(sites 107, 109, and 112) found at the boundary between 
the opsins first extracellular loop and third transmem- 
brane a helix all substituted along the branch leading to 
the last common ancestor of the Ophthalmotilapia ven- 
tralis and Neolamprologus brichardi RH2a(3 opsin 
sequences. Site 112 is a known RH2 pigment spectral 
tuning site [80], while sites 107 and 109 surround an- 
other known RH2 spectral tuning site (site 108) [80]. 
The substitution at site 109 is particularly interesting, as 
the inferred change (F109S) is quite physicochemically 
severe, replacing a bulky, hydrophobic phenylalanine 
residue with a smaller, hydroxyl-bearing serine residue. 
More broadly, it can be seen that more sites known to 
affect spectral sensitivity (or that are adjacent to such 
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N-terminus (AA 1) 
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Figure 3 Opsin snake plot (a) and 3D model (b) showing sites inferred to be evolving under divergent selection pressure according to 
Bayes empirical Bayes analysis with the CmC oPmvr & Pt Clade model. Only sites with PP > 0.75 are shown, (a) The seven transmembrane a 
helixes (TM1-TM7), helix 8 (H8), and the extracellular (E1-E3) and cytoplasmic (C1-C3) loops are labeled. Every 10 th amino acid (starting with AA 
10) is shaded grey and drawn in thicker stroke, (b) The opsin backbone is shown as a gray ribbon, while the retinal chromophore is shown in 
black stick format. The 3D model is viewed with the extracellular side on top. For both (a) and (b) the protein structure and numbering follow 
bovine rhodopsin. 



sites) substitute within the RH2ap clade compared to the 
RH2aa clade. Most of this difference stems from 
changes along the branch ancestral to Ophthalmotilapia 
ventralis and Neolamprologus brichardi, and along the 
branch terminating in Ophthalmotilapia ventralis. 

Discussion 

Evolution after gene duplication is often characterized 
by some combination of relaxed selective constraint and 
the positively selected fixation of advantageous muta- 
tions, ultimately resulting in functional divergence 
among paralogs [3-5,89]. Consistent with these expecta- 
tions, our results (1) revealed a dramatic change in se- 
lective constraint following the African cichlid RH2a 
opsin duplication event and (2) identified the signature 
of divergent evolution at several amino acid sites of 
known functional importance in RH2 visual pigments. 
Clade model analyses revealed that, after duplication, the 
selective regime experienced by many alignment sites 
changed from weak purifying selection to either neutral 
evolution or weak positive selection. Interestingly, this 
switch in constraint applied to both duplicated clades 
relative to the outgroup orthologs, and this pattern was 
only detected once divergence among entire clades was 
considered but not when just the branches immediately 
following the duplication event were considered. 



While the patterns of evolution we observed in the fish 
RH2a opsin data set are consistent with a dramatic 
change in selective constraint following the African cich- 
lid RH2a opsin duplication event, it is not obvious which 
models of gene duplication are operating here, as the 
observed patterns do not neatly fit the predictions of 
most models. The adaptive and non-adaptive (Dykhuizen- 
Hartl) neofunctionalization models [3,90] both posit 
that one copy accumulates previously deleterious substi- 
tutions, potentially leading to the evolution of a new func- 
tion, while the other retains the ancestral function under 
a regime of purifying selection. These neofunctionaliza- 
tion models thus predict asymmetrical rates of evolution 
after duplication between paralogs, but our results indi- 
cate approximately equal shifts in selective constraint 
after duplication in both duplicates (with co changing 
from co ~ 0.5 before duplication to co ~ 1.1-1.5, after- 
wards). The segregation avoidance model [3,91] proposes 
that duplication may beneficially fix both alleles at loci 
harbouring balanced polymorphisms, thus eliminating 
costs associated with segregation load. This model 
predicts that functional divergence occurs among alleles 
before duplication, not long after the duplicate loci have 
fixed as we have found. The dosage model operates 
when possessing multiple loci provides a beneficial in- 
crease in gene product [3,92]; this model seems inappro- 
priate as well, as RH2ap opsin expression is generally 
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Figure 4 RH2a cladogram showing the inferred location of amino acid substitutions at the 27 'divergently evolving' sites listed in 
Table 5. Posterior probabilities of inferred ancestral amino acids are provided in Additional file 1: Table S4. Lines are drawn to indicate the RH2aa 
(dotted blue branches), RH2a(3 (thick red branches), and Lake Tanganyikan tree RH2a(3 partitions (dashed red branches), as in Figure 1. 



quite low in both Oreochromis niloticus and Lake Malawi 
haplochromine cichlids [42,44], and as the paralogs are 
known to be functionally divergent, contrary to model 
predictions. The popular duplication-degeneration-com- 
plementation model applies to multifunctional proteins, 
and describes a scenario by which each daughter protein 
neutrally loses one or more of the multiple functions 
such that both copies are then needed to perform all 
tasks [6]. This model could explain increases in co seen 
in both daughter clades, but this model seems unlikely 
to apply to opsins, at least at the protein coding level, as 
it is difficult to conceive of a way opsin protein bio- 
chemistry could be subfunctionalized. Opsin proteins 
have several measurable biochemical phenotypes (includ- 
ing \ max > active state stability, and regeneration rate), but 
proper visual pigment functioning requires an integrated 
protein for successful phototransduction. Subfunctionali- 
zation could occur at the regulatory level, however [93]. 

The 'gene sharing' model (also referred to as the 
specialization' or escape from adaptive conflict' model) 



[7,8] seems to be the most appropriate model for our 
cichlid opsin data set. If a single-copy protein's ability to 
efficiently serve two roles is compromised due to plei- 
otropy then, after duplication, each copy can adaptively 
specialize on one of the two roles (assuming both roles 
are suboptimal in the ancestor). The post-duplication co 
increases we observed for our RH2a data set may indi- 
cate weak positive selection in both paralogs, which 
would be consistent with this prediction. While this 
model is typically applied to multifunctional proteins 
that carry out two totally different roles (e.g., a structural 
role and an enzymatic role, as in a lens crystallins), it 
can also apply to proteins that perform a single bio- 
chemical task (e.g., a particular enzymatic reaction) if 
there is a benefit to having copies with subtly different 
rates or efficiencies [4], With regard to opsins, this could 
mean that a property such as X max diverged in both cop- 
ies compared to the ancestor, one to a longer wavelength 
and one to a shorter wavelength. Similarly, structural 
constraints may prevent co-optimization of different 
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aspects of the proteins overall function. Such predic- 
tions are testable through ancestral reconstruction and 
functional characterization of the single-copy ancestor. 
Of course, it should be noted that many of these models 
of gene duplicate retention and evolution can act in con- 
cert or subsequent to one another [94]. Indeed, the pat- 
terns of amino acid substitutions inferred along the 
RH2a|3 post-duplication branch are quite different than 
those inferred along the RH2aa post-duplication branch, 
with six highly-clustered sites substituting along the 
RH2a|3 branch (Additional file 1: Table SI). Changes at 
these sites could conceivably influence pigment regener- 
ation rate [76], and this may indicate that both neofunc- 
tionalization and specialization occurred following the 
RH2a duplication event in this system. Furthermore, 
here we have only considered the evolution of protein 
biochemistry, but these models of duplicate gene evolu- 
tion can also be considered with regard to gene expres- 
sion patterns. 

The fact that we uncovered among-lineage co variation 
when Clade models were used, but not when Branch- 
site models were used, opens the possibility that diver- 
gence among African cichlid RH2a opsins is also influ- 
enced by a collection of lineage-specific processes. 
Consistent with this hypothesis, we documented a large 
increase in co along Lake Tanganyikan cichlid RH2ap 
opsin branches that was above and beyond the increase 
already described for the RH2a|3 clade. The estimated co 
ratio for this tree partition was significantly above one, 
clearly indicating the action of positive selection. This 
finding is of note given that our initial focus was only on 
divergence associated with gene duplication, not diver- 
gence among orthologs. Our study thus serves as an ex- 
ample of how the evolution of duplicated genes can be 
driven by both paralog-specific and species-specific pro- 
cesses [11]. This point has practical importance as well. 
Many studies within the field of visual ecology assume 
functional equivalence among ortholgous pigments and 
model focal species' perceptual abilities using data from 
close relatives [95]. Our results suggest that such an ap- 
proach should only be applied tentatively for studies on 
cichlid visual ecology. 

At this point, it is difficult to say what factors are be- 
hind the positive selection operating along the Lake Tan- 
ganyikan cichlid RH2ap opsin lineages, as the visual 
niches inhabited by the three Lake Tanganyikan species 
included in our data set have surely evolved over the 
time-scale captured by our phylogeny. These three spe- 
cies inhabit distinct visual niches, varying in colour pat- 
terning, habitat depth, and diet [96], and these 
ecological differences could precipitate divergent selec- 
tion on opsin biochemistry and expression; the detection 
of divergent sexually selected courtship signals or food 
sources may select for divergence in A max > while vision 



under brighter or dimmer conditions could select for di- 
vergence in non-spectral, kinetic properties of visual pig- 
ments. Interestingly, some of the sites identified as 
positively selected along Lake Tanganyikan RH2a|3 opsin 
lineages are known to effect both spectral (i.e., X max ) and 
non-spectral attributes of visual pigments (Table 5). 
Most notably, the E122Q substitution, which occurred 
along the terminal branch leading to the Lake Tanganyi- 
kan cichlid Ophthalmotilapia ventralis, is known to in- 
crease the X max of RH2 pigments by a large amount 
(-12-16 nm) [78-80], and to affect a number of non- 
spectral, kinetic properties, including the photoactivated 
pigments decay rate, the efficiency with which the acti- 
vated pigment activates the downstream G protein, and 
the rate of visual pigment reformation following retinal 
release [84,97]. For each of these properties, experimen- 
tally adding the E122Q substitution to rod pigments 
produces mutant pigments that behave in a more cone- 
like manner (i.e., with a faster rate of active state decay 
and faster pigment regeneration). Ophthalmotilapia ven- 
tralis is known to generally reside at shallower depths 
than the other two Lake Tanganyikan cichlids in our 
data set (approximate depth range: O. ventralis 2-10 m; 
N. brichardi 5-30 m, T. duboisi 3-15 m) [96], where the 
visual environment is expected to be somewhat brighter. 
We thus see a compatible pattern between opsin mo- 
lecular evolution and comparative visual ecology in this 
system. Overall, our results suggest that the RH2a(3 
opsins play an important role in vision in at least some 
Lake Tanganyikan species, despite the fact that the 
RH2a|3 opsin has generally been found to be lowly 
expressed in other African cichlids. 

It is notable that we were only able to uncover among- 
lineage co variation once Clade models were employed, but 
not when the more popular Branch-site models were used. 
A number of factors are expected to affect the power of 
the Branch-site test [98], and it may be that future ana- 
lyses of larger data sets will yield different results, though 
we suspect otherwise, as substitutions occurred at several 
amino acids along both of the post- duplication branches. 
It appears that functional divergence simply occurred in a 
manner undetectable by the Branch-site methods. 
Branch-site models assume a very specific form of func- 
tional divergence— that is, a punctuated burst of adaptive 
sequence turnover— but functional divergence can instead 
manifest as variation in the overall strength of constraint 
or residue conservation between clades [99]. Most of the 
sites that substituted along the RH2aa and RH2aa post- 
duplication branches also substituted along other 
branches of the phylogeny, and such substitution patterns 
may not fit neatly within the site classes defined by the 
Branch-site models. The design of Clade models makes 
them better able to detect this alternative signature of 
functional divergence. Furthermore, the patterns we 
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uncovered were only detectable because the Clade model 
approach was recently expanded to allow for multiple 
foreground partitions [21], which allowed us to fit models 
that accommodate multiple shifts in the co ratio across the 
phytogeny. To date, only three other studies have 
employed this new approach: Yoshida et al [21] uncov- 
ered variation in the strength of positive selection affecting 
HIV env genes sampled from different decades; Wei and 
Ge [100] documented divergent selective constraint 
among duplicated MADS -box transcription factors in 
grasses; and, finally, Gossmann and Schmid [77] studied 
genome-wide patterns of divergence among duplicated 
Arabidopsis thaliana genes. We note that our study is the 
first, to our knowledge, to explicitly investigate divergence 
among both orthologs and paralogs in the same data set. 
Finally, it is noteworthy that several studies have used the 
results of Clade model analyses to support arguments of 
positive selection [101,102], but whether or not the rele- 
vant o) estimates significantly exceed co = 1 has not been 
explicitly tested, as we have done here; this point, while at 
first glance technical in nature, is of large importance, as 
co estimates larger than co = 1 may occur due to chance. 

In conclusion, our results are indicative of functional 
divergence among African cichlid RH2a opsins driven, at 
least in part, by positive selection. Combined with the 
insights of past studies, which indicate biochemical and 
expression differences among paralogs and among spe- 
cies, our results suggest that there is much to be learned 
by distinguishing among African cichlid RH2a opsin 
sequences, rather than grouping them together on ac- 
count of their high sequence similarity. Furthermore, 
these results provide a framework for mechanistic stud- 
ies of functional diversification among cichlid RH2a 
opsins, and help establish African cichlid RH2a opsins as 
a useful system for research on how function and linkage 
shape the evolution of young tandem duplicates [103]. 
Finally, our study adds to a growing body of research 
directed towards uncovering the molecular signature of 
diversification within the rapidly speciating African cich- 
lid clade. 
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